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Abstract 


Deep learning methods have shown great promise in many practical applications, 
ranging from speech recognition, visual object recognition, to text processing. 
However, most of the current deep learning methods suffer from scalability prob¬ 
lems for large-scale applications, forcing researchers or users to focus on small- 
scale problems with fewer parameters. 

In this paper, we consider a well-known machine learning model, deep belief net¬ 
works (DBNs) that have yielded impressive classification performance on a large 
number of benchmark machine learning tasks. To scale up DBN, we propose an 
approach that can use the computing clusters in a distributed environment to train 
large models, while the dense matrix computations within a single machine are 
sped up using graphics processors (GPU). When training a DBN, each machine 
randomly drops out a portion of neurons in each hidden layer, for each training 
case, making the remaining neurons only learn to detect features that are generally 
helpful for producing the correct answer. Within our approach, we have developed 
four methods to combine outcomes from each machine to form a unified model. 

Our preliminary experiment on the MNIST handwritten digit database demon¬ 
strates that our approach outperforms the state of the art test error rate. 

1 Introduction 

Deep learning methods HI aim to learn a multilayer neural network that can extract the feature hi¬ 
erarchies of the input data, by maximizing the likelihood of its training data. Their promise largely 
lies in the potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models 
with millions of parameters. In recent competitions, deep learning methods have shown adavan- 
tanges over nearest-neighbor based (shallow) methods such kernel methods Eia and ensemble 
methods ilEllSl. 

In this paper, we consider a well-known machine learning model, deep belief networks (DBNs), 
that can learn hierarchical representations of their inputs. DBN has been applied to a number of 
machine learning applications, including speech recognition I?!, visual object recognition |[8]|3 and 
text processing Da, among others. In particular, DBN is especially well-suited to problems with 
high-dimensional inputs, over which it can infer rich models with many hidden layers. For example, 
when applied to images, a DBN can easily have tens of millions of free parameters, and ideally, we 
would want to use millions of unlabeled training examples to richly cover the input space. 

It has been demonstrated that increasing the scale of deep learning, with respect to the number 
of training examples, the number of model parameters, or both, can drastically improve ultimate 
classification accuracy DD. Unfortunately, with most of the current algorithms, even training a 
moderate-sized DBN can take weeks using a conventional implementation on a single CPU DU- 
This is primarily due to the daunting computational requirements in DBN training — a large number 
of parameters need to be trained on the available examples. 


1 




To address the DBN scalability problem, this paper proposes an approach to scale up large-scale 
deep belief networks (DBNs) by adapting the idea of random dropout. Random dropout, proposed 
by Hinton et al. m, was originally used to prevent complex co-adaptations on the training data in 
a single processor. On each training case, each hidden unit is randomly omitted from the network 
with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. By doing 
so, many separate DBNs are trained and then applied independently to the test data to reduce the 
predication bias of a single DBN. 

Our approach extends the random dropout idea to the distributed and parallel setting. Rather than 
omitting a hidden unit with a probability of 0.5, our approach randomly drops out a portion of 
hidden units on each processor on each training case. To combine DBNs in each processor, our 
approach offers four different ways (Section [T2| i: 

1. performing model averaging with all trained DBNs. 

2. using majority vote over the predication result of each trained DBN for each test case. 

3. (each processor) synchronously updating its parameters after fetching the needed parame¬ 
ters from other processors. 

4. (each processor) asynchronously fetching the computed parameters from other processors 
and pushing its computed parameters to other processors. 

As validated in our preliminary evaluation, by using random dropout, our approach outperforms the 
state-of-the-art G1 DBN algorithms on the same data set, and have the potential to exhibit nearly 
linear speedup with its parallel implementation. 

This paper makes the following contributions: 

• Approach. We propose an approach to scale up deep belief networks using random 
dropout flni on large clusters (Section]^. 

• Implementation. We implemented our approach in an open-source prototype, which is 
publicly available at: http : / /deeplearning. googlecode . com 

• Evaluation. We applied our approach to the MNIST dataset, and demonstrated its effec¬ 
tiveness (Section]^. 


2 Related Work 

Recently, many approaches have been developed to scale up machine learning algorithms within a 
machine (via multithreading) and across machines (via message passing) lfT4lfT5l[T^fT7]l . Much of 
the existing work focuses on linear, convex models, and takes distributed gradient computation as the 
first step. Some other approaches relax synchronization requirements, exploring delayed gradient 
updates for convex problems M, or exploring lock-less asynchronous stochastic gradient descent 
on shared-memory architectures (i.e. single machines) QS). 

Another way to scale up machine learning algorithms is to provide better abstractions and well- 
encapsulated computation tools. MapReduce and GraphLab 1^ are two notable examples. 
However, MapReduce, originally designed for parallel data processing, has a number of limitations 
for training deep belief network 1201 . On the other hand, GraphLab Il20l was designed for general 
unstructured graph computations and does not exploit the computational effectiveness in a typical 
structured graph as in a deep belief network. Thus, it is still unknown whether the abstraction of 
GraphLab can be used for training large-scale DBNs. 

In the deep learning community, some work has been done to train relatively small models on a 
single machine m. In general, training a many-layer model is computationally intensive. Thus, full 
model parallelism as well as smart distributed optimization techniques is required. Recent years saw 
a surge of interest in scaling up the training and inference algorithms used for DBNs ifTOlfT^ and 
in improving applicable optimization procedures mi. Existing approaches primarily fall into the 
following two categories. 

Approaches in the first category use graphics processors (GPUs) mu ED to achieve significant 
speedup for training moderate-sized DBNs. The use of GPUs has significantly reduced the com¬ 
putation time of matrix operations, which dominate most of the computation cost of deep learning 
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algorithms. However, a known limitation of such GPU-based approaches is that the speedup will 
be small when the model does not fit in GPU memory (typically less than a few gigabytes). Thus, 
to effectively leverage a GPU, researchers often reduce the model size and the parameter number 
to alleviate the impact of lacking enough GPU memory. While data and parameter reduction work 
well for small problems (e.g. acoustic modeling for speech recognition they are less attrac¬ 
tive for realistic problems with a large number of examples and dimensions (e.g., high-resolution 
images EU). 

Approaches in the second category use model parallelism to achieve scalability. For example, 
DistBelief ll^ is a notable framework that enables model parallelism across machines (via mes¬ 
sage passing), with the details of parallelism, synchronization and communication managed by the 
framework. Model parallelism under the DistBelief framework suffers a very large communication 
overhead due to the dense connections between layers of neurons. Data parallelism is also sup¬ 
ported in DistBelief by using multiple replicas of a model to optimize a single objective. However, 
as pointed out by Hinton et al. Ha, a large neural network, such as the one trained by DistBelief, 
can still perform poorly on held-out test data, if the relationship between the input and the correct 
output is complicated and the network has enough hidden units to model it accurately. In such cases, 
there will typically be many different settings of the weights that can model the training set almost 
perfectly. Each of these weight vectors will make different predictions on held-out test data and al¬ 
most all of them will do worse on the test data than on the training data because the feature detectors 
have been tuned to work well together on the training data but not on the test data. 

Our approach is inspired by those above mentioned approaches, and aims to address their limita¬ 
tions. With the goal of scaling up deep learning techniques to train very large DBNs, our approach 
combines the intrinsic parallelism in the ensemble learning algorithms, with the random dropout 
approach lfT3l to improve generalization results of neural networks. Using random dropout, our 
approach trains a separate DBN (much smaller than the original DBN) on an individual (graphical) 
processor in a large cluster, and then combines their results using four proposed methods. Compared 
to existing approaches, our random dropout-based approach has several noticeable benefits. First, it 
becomes possible to train a huge number of different networks in a reasonable time, since the num¬ 
ber of parameters to be trained on a single machine is much smaller than the original DBN. Second, 
our approach permits to better use the modern GPU memory due to the reduced model size. Third, 
data transferring between processors would incur less communication overhead. 


3 Proposed Approach 

To train large DBNs, we propose an approach that supports distributed computation in neural net¬ 
works. At a high level, our approach consists of two steps: model parallelism (Section HD and 
model combination (Section [3. 2| i. In the first model parallelism step, our approach automatically 
parallelizes computation in each machine using all available resources, and manages communica¬ 
tion, synchronization and data transfer between machine. In the second step, our approach supports 
four different ways to combine results from each machine to form a unified DBN. 

3.1 Model Parallelism 

Figure [^illustrates the model parallelism step on two machines. 

It is worthy noting that the computational needs of training a DBN on one machine depends on its 
connectivity structure, and random dropout can significantly reduce the complexity of a DBN as 
well as the number of parameters: dropping out 50% of neurons at each layer can lead to a reduc¬ 
tion of 75% of the parameters (connection weights). In general, given a dropout probability p, our 
approach permits model parallelism among at most 1 machines, with each machine updating a 
disjoint portion of weight matrix. This is fundamentally different from existing model parallelism 
approaches 12^ . For example, in the DistBelief ll23l framework, a user needs to define the compu¬ 
tation that takes place at each machine in each layer of the model, while our approach distributes the 
computation of training each DBN fully automatically to each machine. In addition, for a frame¬ 
work like DistBelief ||23]| . the complexity of a DBN is not reduced; rather, a DBN is partitioned (as a 
graph) into available machines, and each machine must communicate frequently (with the parameter 
server) for updating weights. Therefore, large models with high computational demands might ben- 
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Combining DBNs: 

1. Averaging weights 

2. Majority vote over predications 

3. Synchronized weight update 

4. Asynchronized weight update 


Figure 1: An example of model parallelism of a 3-layer neural network on 2 machines. In this exam¬ 
ple, each machine randomly drops out a portion of neurons and trains a separate DBN independently. 
Later, our approach combines the trained DBNs using four methods described in Section [3^ 

1: compute Aw {calculate the gradient} 

2: fetch w from the weight queue 
3: w' w - rjAw 

4: push w' to other machines’ weight queues 

Figure 2: The asynchronous parameter updating algorithm in each machine on each training data. 
Each machine has a weight queue receiving the updated parameters from other machines. At the 
beginning, each machine has the same replica of parameters. 


efit from access to more CPUs and memory at the beginning, but will be limited by the bottleneck 
where communication costs dominate at some point. By contrast, in our approach, DBNs produced 
by random dropout tend to be more amenable to extensive distribution than fully-connected struc¬ 
tures, given their less complex structures and lower communication requirements. Doing so will also 
help alleviate the bottleneck in which many machines are waiting for the single slowest machine to 
finish a given phase of computation. 

3.2 Model Combination 

Our approach provides four ways to combine the trained DBN from each machine; 

1. Averaging weights of the trained DBN in each machine. 

2. Majority voting of the predication result for test data using the trained DBN in each ma¬ 
chine (our implementation breaks possible cycles by arbitrarily ranking the predication 
results, but this did not occur in our experiments). 

3. Synchronously updating parameter weights during DBN training in each machine. 

4. Asynchronously updating parameter weights during DBN training in each machine. 

The first two ways for model combination are straightforward, and are omitted in this paper for 
space reasons. We next describe how to update parameter weights (a)synchronously. 

Figure [^sketches the asynchronous parameter weight updating algorithm in each machine on each 
training data. This lock-free asynchronous algorithm prevents each machine from waiting for others 
to finish before proceeding to the next training data, while sacrificing the data consistency of param¬ 
eters - it is possible that two machines are updating the same parameters simultaneously without an 
explicit ordering. As a consequence, this asynchronous algorithm may introduce additional stochas- 
ticity in training. 



1: compute Aw {calculate the gradient} 

2: fetch w from the parameter server {wait until all other machines finish updating wj 
3: w' w - rjAw 

4: update w' on the parameter server {wait until all other machines finish updating w'} 

Figure 3; The sychronous parameter updating algorithm in each machine on each training data. 
Only the parameter serve stores all weights; each machine fetches the needed weights from the 
server before updating. 


Figure 1^ sketches the synchronous parameter weight updating algorithm in each machine. In this 
algorithm, there is a central parameter server storing weights of all parameters. At the end of each 
mini-batch (a set of training data), each machine sends a request to the parameter server to fetch the 
needed parameter weights. If other machines are updating the requested parameters, this machine 
needs to wait until all machines finish their updates. Comparing to the asynchronous algorithm, this 
algorithm eliminates possible data races and improves data consistency of the same parameters, but 
introduces higher overhead. 

3.3 Implementation 

We implemented our approach in a prototype using Matlab and Python. Our prototype uses the 
Theano ll24l library to define, optimize, and evaluate mathematical expressions involving multi¬ 
dimensional arrays efficiently. Specifically, our implementation uses Theano to achieve significant 
speedup in data-intensive calculations by leveraging the transparent use of a GPU. When combining 
the trained DBNs, our implementation uses inter-process communication (IPC) ll25l to exchange 
data among multiple threads in one or more processes. Our implementation is publicly available at; 
http://deeplearning.googlecode.com 


4 Evaluation 

4.1 Details of dropout training on MINST dataset 

The architectures of a deep network (the number of layers and the number of hidden units in each 
layer) varies on different benchmark tasks. Our first step is to develop a prototype and evaluate its 
effectiveness on the MNIST handwritten digits dataset [I), which consists of 28 x 28 digit images, 
- 60,000 for training and 10,000 for testing. The objective is to classify the digit images into their 
correct digit class. We experimented a neural network with size 784-500-500-2000-10, and pre¬ 
trained that network as a layer-wise Restricted Boltzmann Machine (RBM) for 50 epochs. Here 
one epoch means a pass through training data. During the pre-training phase, we employs a greedy 
Contrastive Divergence-1 learning algorithm. The learning rate is exponentially decaying, with 
initial value 10.0 and a decay rate 0.998 for each epoch of training. Weights were updated at the end 
of each mini-batch of size 100. Momentum is also used, with an initial value of 0.5, and a linear 
increasing rate of 0.001 over the first 490 epochs, after which it stays at 0.99. For the fine tuning 
using back-propagation with dropout for 200 epochs, we employed the stochastic gradient descent 
with mini-batches of size 100. We use the same dropout rates p for all hidden units and 20% dropout 
for input pixels. A constant learning rate 1.0 was used and there’s no constraints imposed on the 
norms of weight vectors. 

4.2 Generalization Performance as a function of dropout probability 

In the original dropout ifTSll article, Hinton et al claims that a better generalization performance can 
be achieved with various dropout probabilities. With implementation details shown in section |4~T| 
here we show how the test error rate varies as a function of dropout probability. As demonstrated in 
figure [42| dropout does decrease the test error rate by about 0.1% (10 less misclassified examples 
in the test data set). Contrary to their claim, we found that the generation performance of dropout 
actually depends on the dropout probability. When the dropout probability is greater than 0.6, the 
test error rate increases significantly. Such inconsistency might be due to the much smaller training 
epochs (1,000 epochs used in Hinton’s paper) used in our implementation. 
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Figure 4; The test error rate on the MNIST for a variety of dropout probabilities. The input visible 
neurons also use 20% dropout. The best previous published results for this task using DBN without 
dropout are 1.18% (with RBM pre-training) and 1.6% (without RBM pre-training), using more than 
1,000 epochs ifTSl . 


4.3 Generalization Performance of our distributed algorithm 

Here we evaluate the generalization performance on the MNIST dataset using various algorithms to 
combine results from different machines, as listed in section [3721. 


Algorithms 

Sequential 

Weight Averaging 

Majority Vote 

Synchronous Update 

Asynchronous Update 

Error Rate (%) 

1.08 

0.98 

1.04 

0.97 

1.06 


Table 1: Test error rate using different algorithms for 200 epochs of hne tuning. Weight averaging 
and majority vote algorithms collect hnal weights from 7 independent runs of the standard dropout 
algorithms. Synchronous update and asynchronous update algorithms combine results from two 
processes after each input instance. Dropout rate is 50% for all algorithms. 


Both weight averaging and synchronous update algorithms achieved notable improvement in the 
generalization performance. Surprisingly the majority vote method didn’t reduce the test error rate 
by a large margin. The asynchronous update algorithm introduced additional noise in the weight 
updates by its lock free mechanism, thus its generalization performance was relatively the same as 
the sequential dropout algorithm. However, as exhibited in hgure the convergence rate for both 
synchronous and asynchronous update algorithms are faster than the sequential dropout algorithm. 

4.4 Limitations 

Our current evaluation was performed on a desktop with a Dual Core Intel E7400 processor, 3GB 
RAM, and a NVIDIA 8800GS graphics card. Pretraining/fine tuning are generally very time con¬ 
sumption on this machine. Due to time constraints, we were only able to evaluate our proposed 
algorithm on the relatively small MNIST dataset. However, we plan to further evaluate our algo¬ 
rithm on other speech and object recognition benchmark tasks such as the TIMIT Acoustic-Phonetic 
Continuous Speech Corpus Ii26l . Reuters Corpus for news article topic recognition Ezl, and the 
ImageNet dataset of millions of labeled images in thousands of categories li28l . 
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—Sequential Dropout 
— Synchronous Update 
— Asynchronous Update 
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Figure 5: The test error rate on the MNIST as a function of number of epochs, using syn¬ 
chronous/asynchronous updates. 


5 Conclusion 

This paper proposes an approach to scale up deep belief networks (DBNs). At the core of our 
approach is the use of random dropout to prevent co-adaptions on the training data for a DBN, reduce 
overfitting, and enable DBN training to use the computational power of clusters in a distributed 
environment. Empirically, our implementation outperforms some state-of-the-art approaches lua, 
and promising nearly linear speedups. Furthermore, our approach allows parallel training of a DBN 
even when some gradients are computationally intensive. 

For future work, it would be interesting to compare our approach with other approaches using dif¬ 
ferent abstractions Eollli. For example, the PowerGraph ll20l abstraction exploits the internal 
structure of graph programs to address the challenges of computation on natural graphs. Thus, it 
may be possible to adapt a similar random dropout idea to reduce memory consumption and com¬ 
munications between processors. An investigation into how to generalize this approach to other 
structures and problems would enable even faster computation of machine learning problems. 
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